home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Shareware Grab Bag
/
Shareware Grab Bag.iso
/
001
/
pctj0486.rqw
/
pctj0486.rvw
Wrap
Text File
|
1986-04-12
|
7KB
|
107 lines
Review of the article "Statistical Correlation", by Thomas Madron in the
April, 1986 issue of PC Tech Journal.
This article could have been a useful addition to the literature on
statistical computing methods for microcomputers and could have provided
readers with a reasonable introduction to multivariate statistical analysis.
However, it is so permeated with incorrect statistical theory and naive
computing methods that readers should be warned not to use either the text or
the program listings for guidance in writing a statistics package. I am not
quibbling over minor discrepencies or over issues that are being honestly
debated in the statistical community. Rather, I am challenging the
authors' understanding of some of the fundamental concepts of multivariate
statistical computing.
Specifically, the following are significant misstatements of fact and
erroneous interpretations of statistical methods:
1. The text accompanying Figure 1 states "The correlation coefficient is the
slope of the 'best fit' straight line through these points." In fact, the
correlation coefficient equals the slope of the line times the ratio of the
standard deviations of the two variables.
2. On page 128 is the statement "A coefficient of + or - 1.0 implies a
completely causal relation between two variables ... ." In fact, a unit
correlation only implies that two variables are perfectly associated and says
nothing about causal relationships. This is an extremely important
distinction that students learn in their first class in correlation.
3. The discussion of the consequences of missing data on page 130 is obscure
at best. For example, the statement "A correlation coefficient based on these
two variables can have a somewhat different meaning than if all respondents
had answered both questions" is meaningless, since the correlation is simply
computed on the sample of observations with data present on both variables.
4. The description of Figure 5 is incorrect and incomplete. The title
"Sample Correlation Matrix" is wrong, since Figure 5 is a contingency table
display of the frequencies of occurrence of the responses to the two
questions. While the rows and columns of Figure 5 are never described, the
text implies that "3" represents missing data and the valid responses are "1"
and "2". In that case, Pearson's product moment correlation is entirely
inappropriate to describe the association between two dichotomous variables,
since it is used to measure the association between continuous variables.
5. On page 130 the statement "CORL.FOR is a linear analysis, finding a linear
least-squares fit and performing a linear transformation to normalize data
around 0" is completely incorrect and reveals the authors' total ignorance of
the subject, since neither correlation nor linear least-squares normalizes
data about anything. If one wanted to normalize the data around 0, one could
subtract the mean and divide by the standard deviation to transform each
observation to a standard normal deviate.
6. The author has obviously confused the population standard deviation with
the estimate of it based on a sample from that population. The glossary on
page 132 and the program listing on page 140 both indicate that the
denominator of the computed standard deviation is N, when, in fact, the
correct value in this case is N-1. There are some cases when N might be
justified, but the simple linear model analysis problem is not one of them.
7. On page 140 is the comment "Programs that calculate significance tests
usually need an estimate of the number of observations. Subsequent programs
use the LOWEST number of observations taken from the lower diagonal matrix as
a conservative estimate since any significance tests based on a data matrix
with missing data are suspect." Nonsense!. In the first place, one does not
estimate the number of observations since one can count them exactly. What
the author probably meant to say was that in making multivariate tests of
hypotheses with missing data some adjustments may be required to the degrees
of freedom for the particular test. However, univariate tests of significance
on individual correlations with different numbers of observations are entirely
appropriate and valid.
In addition, the authors' discussion of the computing issues involved in
calculating correlations and standard deviations on microcomputers on pages
128-129 is grossly inadequate. As shown in "Statistical Programs for
Microcomputers", by Peter A. Lachenbruch (Byte Magazine, November, 1983)
arithmetic on sums and sums of squares can be deceptively treacherous. For
example, the authors' computational formula on page 129 was originally
developed for mechanical calculators to avoid the need for making two passes
through the data. However, when that formula is blindly applied to data that
is large in magnitude and has little variation, the results can be totally
unpredictable. The problem is compounded by performing the operations on sums
and sums of squares in single precision, as the author has done on page 140.
At the very least, the potentially disastrous results of accumulating
round-off errors can be moderated by performing these operations in double
precision. Also, centering the data by subtracting the means before the
correlations are computed can more than make up for the added execution time
by providing some additional protection against producing meaningless results.
These and other issues related to the potential loss of precision in
statistical microcomputing are discussed in some detail in the excellent
article by Lachenbruch.
I wasn't going to expend this effort until I saw that the author intends to
publish more articles in the future on test reliabliity, stepwise multiple
regression, factor analysis, and other multivariate methods of analysis.
Based on the quality of the authors' first effort, the potential for disaster
is enormous. I strongly recommend that anyone wanting to do multivariate
statistical analysis on their microcomputer get their guidance from someone
who has demonstrated at least a minimal level of competency in both
statistical methods and statistical computing. In my opinion, this author has
done neither. I am frankly amazed that the editorial process at PC Tech
Journal is so weak as to allow potentially harmful information like this into
print. PCTJ would do its readers a service by having a competent statistician
review such articles on statistics before they are published.
David N. Ikle', Ph.D.
Biostatistician
Denver, CO